•FROM: Miyuki Miyasaka, [105721,2676] 
TO: OLIFF, [74750,3471] 
DATE: 10/10/2001 12:23 AM 

Re: E006299US00 

*** File Message *** HLC? ; 'FC 

OLIFF c- I • 'RRIOGl 

The size of the file is 168886 bytes. 

The file will be stored in C:\CSERVE\DOWNLOAD\F00629~1 .RTF 

'01 OCT 10 ,Ti as- 
Additional Information: 


•FROM: Miyuki Miyasaka, [105721,2676] 
TO: OLIFF, [74750,3471] 
DATE: 10/10/2001 12:23 AM 

Re: R006299US00 

*** File Message *** 

The size of the file is 168886 bytes. 

The file will be stored in C:\CSERVE\DOWNLOAD\F00629~1.RTF 


Additional Information: 


[Name of Document] SPECIFICATION 

[Title of the Invention] SPEECH RECOGNITION METHOD, STORAGE MEDIUM 

STORING SPEECH RECOGNITION PROGRAM, AND SPEECH RECOGNITION 
APPARATUS 

[Claims] 

[Claim 1] A speech recognition method, wherein speech data on which 
different types of noise have been superposed respectively are created, 
the noise is eliminated by a predetermined noise elimination method from 
each of the speech data on which the noise has been superposed, and 
acoustic models corresponding to each of the noise types are created and 
stored using feature vectors of each of the speech data which have 
undergone the noise elimination; 

and when a speech recognition is performed, 

the type of a noise superposed on speech data to be recognized is 
determined, a corresponding acoustic model is selected from the acoustic 
models corresponding to each of the noise types based on the result of 
the determination, the noise is eliminated by the predetermined noise 
elimination method from the speech data to be recognized on which the 
noise has been superposed, and a speech recognition is performed on the 
feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

[Claim 2] A speech recognition method according to Claim 1, wherein 
the noise elimination method is a spectral subtraction method or a 
continuous spectral subtraction method, and the acoustic models are 
created by eliminating the noise by the spectral subtraction method or 
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the continuous spectral subtraction method from each of the speech data 
on which the different types of noise have been superposed, obtaining 
the feature vectors of each of the speech data which have undergone the 
noise elimination, and using the feature vectors; 
and when a speech recognition is performed, 

a first speech feature analysis is performed to obtain frequency- 
domain feature data of the speech data on which the noise has been 
superposed; 

it is determined whether the speech data is a noise segment or a " 
speech segment based on the result of the feature analysis, and when a 
noise segment is detected, the feature data thereof is stored, whereas 
when a speech segment is detected, the type of the noise superposed is 
determined based on the feature data having been stored and a 
corresponding acoustic model is selected from the acoustic models 
corresponding to each of the noise types based on the result of the 
determination; 

the noise is eliminated by the spectral subtraction method or the 
continuous spectral subtraction method from the speech data to be 
recognized on which the noise has been superposed; and 

a second feature analysis is performed on the speech data which has 
undergone the noise elimination to obtain feature data required in the 
speech recognition, and a speech recognition is performed on the result 
of the feature analysis based on the selected acoustic model. 

[Claim 3] A speech recognition method according to Claim 1, wherein 
the noise elimination method is a cepstrum mean normalization method, 
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and the acoustic models are created by eliminating the noise by the 
cepstrum mean normalization method from each of the speech data on which 
the different types of noise have been superposed and using the feature 
vectors of the speech data obtained thereby; 

and when a speech recognition is performed, 

a first speech feature analysis is performed on the speech data to 
be recognized on which the noise has been superposed to obtain a feature 
vector representing cepstrum coefficients; 

it is determined whether the speech data is a noise segment or a t 
speech segment based on the result of the feature analysis, and when a 
noise segment is detected, the feature vector thereof is stored, and 
when a speech segment is detected, the feature data of the speech 
segment from the beginning through the end thereof is stored, the type 
of the noise superposed is determined based on the feature vector of the 
noise segment having been stored, and an acoustic model is selected from 
the acoustic models corresponding to each of the noise types based on 
the result of the determination; 

the noise is eliminated by the cepstrum mean normalization method 
from the speech segment on which the noise has been superposed using the 
feature vector of the speech segment having been stored, and 

a speech recognition is performed on the feature vector after the 
noise elimination based on the selected acoustic model. 

[Claim 4] A speech recognition method according to one of Claims 1 to 
3, wherein the acoustic models correspond to not only types of noise but 
also a plurality of S/N ratios for each of the noise types, and the 
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acoustic models corresponding to the plurality of S/N ratios for each of 
the noise types are created by generating speech data on which noises 
with the plurality of S/N ratios for each of the noise types have been 
respectively superposed, eliminating the noises from each of the speech 
data by a predetermined noise elimination method, and using the feature 
vectors of each of the speech data which have undergone the noise 
elimination. 

[Claim 5] A speech recognition method according to Claim 4, wherein 
when the acoustic models corresponding to the plurality of S/N ratios 
for each of the noise types are created, in addition to determining the 
type of the noise superposed on the speech data to be recognized, the 
S/N ratio is obtained from the magnitude of the noise in the noise 
segment and the magnitude of the speech in the speech segment, and an 
acoustic model is selected based on the noise type determined and the 
S/N ratio obtained. 

[Claim 6] A speech recognition method, wherein speech data on which 
different types of noise have been superposed respectively are created, 
the noise is eliminated by a spectral subtraction method or a continuous 
spectral subtraction method from each of the speech data on which the 
different types of noise have been superposed, a cepstrum mean 
normalization method is applied to each of the speech data which have 
undergone the noise elimination to obtain feature vectors of a speech 
segment, and acoustic models corresponding to each of the noise types 
are created and stored based on the feature vectors; 
and when a speech recognition is performed, 
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a first speech feature analysis is performed to obtain frequency- 
domain feature data of speech data to be recognized; 

it is determined whether the speech data is a noise segment or a 
speech segment based on the result of the feature analysis, and when a 
noise segment is detected, the feature vector thereof is stored; 

and when a speech segment is detected, the noise is eliminated from 
the speech segment by the spectral subtraction method or the continuous 
spectral subtraction method; 

a second speech feature analysis is performed on the speech segment 
data which has undergone the noise elimination to obtain cepstrum 
coefficients, and the feature vector of the speech segment is stored; 

when the speech segment has terminated, the type of the noise 
superposed is determined based on the feature data of the noise segment 
having been stored, and an acoustic model is selected from the acoustic 
models corresponding to each of the noise types; 

the cepstrum mean normalization method is applied to the feature 
vector of the speech segment on which the noise has been superposed, 
using the feature vector of the speech segment having been stored, to 
obtain the feature vector of the speech segment; and 

a speech recognition is performed on the feature vector obtained by 
the cepstrum mean normalization method based on the selected acoustic 
model . 

[Claim 7] A speech recognition method according to Claim 6, wherein 
the acoustic models correspond to not only types of noise but also a 
plurality of S/N ratios for each of the noise types, and the acoustic 
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models corresponding to the plurality of S/N ratios for each of the 
noise types are created by generating speech data on which noises with 
the plurality of S/N ratios for each of the noise types have been 
respectively superposed, eliminating the noises from each of the speech 
data by the spectral subtraction method or the continuous spectral 
subtraction method, and using the feature vectors of each of the speech 
data obtained by applying the cepstrum mean normalization method to each 
of the speech data which have undergone the noise elimination. 

[Claim 8] A speech recognition method according to Claim 7 , wherein ~j 
when the acoustic models corresponding to the plurality of S/N ratios 
for each of the noise types are created, in addition to determining the 
type of the noise superposed on the speech data to be recognized, the 
S/N ratio is obtained from the magnitude of the noise in the noise 
segment and the magnitude of the speech in the speech segment, and an 
acoustic model is selected based on the noise type determined and the 
S/N ratio obtained. 

[Claim 9] A speech recognition method, wherein speech data on which a 
particular type of noise with different S/N ratios have been superposed 
respectively are created, the noise is eliminated by a predetermined 
noise elimination method from each of the speech data, and acoustic 
models corresponding to each of the S/N ratios are created and stored 
using feature vectors of each of the speech data which have undergone 
the noise elimination; 

and when a speech recognition is performed, 

the S/N ratio of a no^-se superposed on speech data to be recognized 
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is determined, a corresponding acoustic model is selected from the 
acoustic models corresponding to each of the S/N ratios based on the 
result of the determination, the noise is eliminated by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise has been superposed, and a speech 
recognition is performed on the feature vector of the speech data which 
has undergone the noise elimination based on the selected acoustic 
model . 

[Claim 10] A speech recognition method according to Claim 9, wherein^ 
the noise elimination method is a spectral subtraction method or a 
continuous spectral subtraction method. 

[Claim 11] A speech recognition method according to Claim 9, wherein 
the noise elimination method is a cepstrum mean normalization method. 

[Claim 12] A storage medium storing a speech recognition program 
comprising : 

the step of creating speech data on which different types of noise 
have, been superposed respectively, eliminating the noise by a 
predetermined noise elimination method from each of the speech data on 
which the noise has been superposed, and creating acoustic models 
corresponding to each of the noise types using the feature vectors 
obtained by analyzing the features of each of the speech data which have 
undergone the noise elimination, and storing the acoustic models in 
acoustic model storage means; 

the step of determining the type of a noise superposed on speech 
data to be recognized, and^selecting a corresponding acoustic model from 
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the acoustic models stored in said acoustic model storage means based on 
the result of the determination; 

the step of eliminating the noise by the predetermined noise 
elimination method from the speech data to be recognized on which the 
noise has been superposed; and 

the step of performing a speech recognition on the feature vector 
of the speech data which has undergone the noise elimination based on 
the selected acoustic model. 

[Claim 13] A storage medium storing a speech recognition program -r 
according to Claim 12, wherein the noise elimination method is a 
spectral subtraction method or a continuous spectral subtraction method, 
and the acoustic models are created by eliminating the noise by the 
spectral subtraction method or the continuous spectral subtraction 
method from each of the speech data on which the different types of 
noise have been superposed, obtaining the feature vectors of each of the 
speech data which have undergone the noise elimination, and using the 
feature vectors; 

the processes for a speech recognition comprises: 

the step of performing a first speech feature analysis to obtain 
frequency-domain feature data of the speech data on which the noise has 
been superposed; 

the step of determining whether the speech data is a noise segment 
or a speech segment based on the result of the feature analysis, storing 
the feature data thereof when a noise segment is detected, whereas when 
a speech segment is detected, determining the type of the noise 
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superposed based on the feature data having been stored and selecting a 
corresponding acoustic model from the acoustic models corresponding to 
each of the noise types based on the result of the determination; 

the step of eliminating the noise by the spectral subtraction 
method or the continuous spectral subtraction method from the speech 
data to be recognized on which the noise has been superposed; and 

the step of performing a second feature analysis on the speech dat 
which has undergone the noise elimination to obtain feature data 
required in the speech recognition, and performing a speech recognition" 
on the result of the feature analysis based on the selected acoustic 
model . 

[Claim 14] A storage medium storing a speech recognition program 
according to Claim 12 , wherein the noise elimination method is a 
cepstrum mean normalization method, and the acoustic models are created 
by eliminating the noise by the cepstrum mean normalization method from 
each of the speech data on which the different types of noise have been 
superposed and using the feature vectors, of the speech data obtained 
thereby; 

and the processes for a speech recognition comprises: 
the step of performing a first speech feature analysis on the 
speech data to be recognized on which the noise has been superposed to 
obtain a feature vector representing cepstrum coefficients; 

the step of determining whether the speech data is a noise segment 
or a speech segment based on the result of the feature analysis, and 
storing the feature vector thereof when a noise segment is detected, 
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whereas when a speech segment is detected, storing the feature data of 
the speech segment from the beginning through the end thereof, 
determining the type of the noise superposed based on the feature vector 
of the noise segment having been stored, and selecting an acoustic model 
from the acoustic models corresponding to each of the noise types based 
on the result of the determination; 

the step of eliminating the noise by the cepstrum mean 
normalization method from the speech segment on which the noise has 
beenn superposed using the feature vector of the speech segment having I 
been stored, and 

the step of performing a speech recognition on the feature vector 
after the noise elimination based on the selected acoustic model. 

[Claim 15] A storage medium storing a speech recognition program 
according to one of Claims 12 to 14, wherein the acoustic models 
correspond to not only types of noise but also a plurality of S/N ratios 
for each of the noise types, and the acoustic models corresponding to 
the plurality of S/N ratios for each of the noise types are created by 
generating speech data on which noises with the plurality of S/N ratios 
for each of the noise types have been respectively superposed, 
eliminating the noises from each of the speech data by a predetermined 
noise elimination method, and using the feature vectors of each of the 
speech data which have undergone the noise elimination. 

[Claim 16] A storage medium storing a speech recognition program 
according to Claim 15, wherein when the acoustic models corresponding to 
the plurality of S/N ratios for each of the noise types are created, in 
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addition to determining the type of the noise superposed on the speech 
data to be recognized, the S/N ratio is obtained from the magnitude of 
the noise in the noise segment and the magnitude of the speech in the 
speech segment, and an acoustic model is selected based on the noise 
type determined and the S/N ratio obtained. 

[Claim 17] A storage medium storing a speech recognition program 
comprising : 

the step of creating speech data on which different types of noise 
have been superposed respectively, eliminating the noise by a spectral \ 
subtraction method or the continuous spectral subtraction method from 
each of a speech data on which the different types of noise have been 
superposed, applying a cepstrum mean normalization method to each of the 
speech data which have undergone the noise elimination to obtain the 
feature vectors of a speech segment, and creating acoustic models 
corresponding to each of the noise types based on the feature vectors 
and storing the acoustic models in acoustic model storage means; 

the step of performing a first speech feature analysis to obtain 
frequency-domain feature data of speech data to be recognized on which a 
noise has been superposed; 

the step of determining whether the speech data is a noise segment 
or a speech segment based on the result of the feature analysis, and 
storing the feature vector thereof when a noise segment is detected; 

the step of eliminating the noise from the speech segment by the 
spectral subtraction method or the continuous spectral subtraction 
method when a speech segment is detected; 
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the step of performing a second speech feature analysis on the 
speech segment data which has undergone the noise elimination to obtain 
cepstrum coefficients, and storing the feature vector of the speech 
segment; 

the step of, when the speech segment has terminated, determining 
the type of the noise superposed based on the feature data of the noise 
segment having been stored, and selecting an acoustic model from the 
acoustic models corresponding to each of the noise types; 

the step of applying the cepstrum mean normalization method to thef 
feature vector of the speech segment on which the noise has been 
superposed, using the feature vector of the speech segment having been 
stored, to obtain the feature vector of the speech segment; and 

the step of performing a speech recognition on the feature vector 
obtained by the cepstrum mean normalization method based on the selected 
acoustic model. 

[Claim 18] A storage medium storing a speech recognition program 
according to Claim 17, wherein the acoustic models correspond to not 
only types of noise but also a plurality of S/N ratios for each of the 
noise types, and the acoustic models corresponding to the plurality of 
S/N ratios for each of the noise types are created by generating speech 
data on which noises with the plurality of S/N ratios for each of the 
noise types have been respectively superposed, eliminating the noises 
from each of the speech data by the spectral subtraction method or the 
continuous spectral subtraction method, and using the feature vectors of 
each of the speech data obtained by applying the cepstrum mean 
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normalization method to each of the speech data which have undergone the 
noise elimination. 

[Claim 19] A storage medium storing a speech recognition program 
according to Claim 17, wherein when the acoustic models corresponding to 
the plurality of S/N ratios for each of the noise types are created, in 
addition to determining the type of the noise superposed on the speech 
data to be recognized, the S/N ratio is obtained from the magnitude of 
the noise in the noise segment and the magnitude of the speech in the 
speech segment, and an acoustic model is selected based on the noise 
type determined and the S/N ratio obtained. 

[Claim 20] A storage medium storing a speech recognition program 
comprising : 

the step of creating speech data on which a particular type of 
noise with different S/N ratios have been superposed respectively, 
eliminating the noise by a predetermined noise elimination method from 
each of the speech data, and creating acoustic models corresponding to 
each of the S/N ratios using the feature, vectors of each of the speech 
data which have undergone the noise elimination and storing the acoustic 
models in acoustic model storage means; 

the step of determining the S/N ratio of a noise superposed on 
speech data to be recognized, and selecting a corresponding acoustic 
model from the acoustic models corresponding to each of the S/N ratios 
based on the result of the determination; 

the step of eliminating the noise by the predetermined noise 
elimination method from the speech data to be recognized on which the 
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noise has been superposed; and 

the step of performing a speech recognition on the feature vector 
of the speech data which has undergone the noise elimination based on 
the selected acoustic model. 

[Claim 21] A storage medium storing a speech recognition program 
according to Claim 20, wherein the noise elimination method is a 
spectral subtraction method or a continuous spectral subtraction method. 

[Claim 22] A storage medium storing a speech recognition program 
according to Claim 20, wherein the noise elimination method is a f 
cepstrum mean normalization method. 

[Claim 23] A speech recognition apparatus comprising: 

acoustic models corresponding to each of different types of noise, 
created by generating speech data on which the different types of noise 
have been superposed respectively, eliminating the noise by a 
predetermined noise elimination method from each of the speech data on 
which the different types of noise have been superposed, and using the 
feature vectors of each of the speech data which have undergone the 
noise elimination; 

acoustic model storage means for storing the acoustic models; 

noise determination means for determining the type of a noise 
superposed on speech data to be recognized; 

acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the 
noise types based on the result of the determination; 

noise elimination means for eliminating the noise by the 
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predetermined noise elimination method from the speech data to be 
recognized on which the noise has been superposed; and 

speech recognition means for performing a speech recognition on the 
feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

[Claim 24] A speech recognition apparatus according to Claim 23, 
wherein the noise elimination method is a spectral subtraction method or 
a continuous spectral subtraction method, and the acoustic models are 
created by eliminating the noise by the spectral subtraction method or J 
the continuous spectral subtraction method from each of the speech data 
on which the different types of noise have been superposed, obtaining 
the feature vectors of each of the speech data which have undergone the 
noise elimination, and using the feature vectors; 

and the speech recognition apparatus comprises: 

acoustic model storage means for storing the acoustic models thus 
created; 

first speech feature analysis means for performing a first speech 
feature analysis to obtain frequency-domain feature data of the speech 
data on which the noise has been superposed; 

noise segment /speech segment determination means for determining 
whether the speech data is a noise segment or a speech segment based on 
the result of the feature analysis, and when a noise segment is 
detected, storing the feature data thereof in feature data storage 
means ; 

noise type determination means for determining the type of noise 
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the type of the noise superposed based on the feature data having been 
stored when a speech segment is detected; 

acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the 
noise types based on the result of the determination; 

noise elimination means for eliminating the noise by the spectral 
subtraction method or the continuous spectral subtraction method from 
the speech data to be recognized on which the noise has been superposed; 

second speech feature analysis means for performing a second 4 

5 

feature analysis on the speech data which has undergone the noise 
elimination to obtain feature data required in the speech recognition; 
and 

speech recognition means for performing a speech recognition on the 
result of the feature analysis based on the selected acoustic model. 

[Claim 25] A speech recognition apparatus according to Claim 23, 
wherein the noise elimination method is a cepstrum mean normalization 
method, and the acoustic models are created by eliminating the noise by 
the cepstrum mean normalization method from each of the speech data on 
which the different types of noise have been superposed and using the 
feature vectors of the speech data obtained thereby; 

and the speech recognition apparatus comprises: 

acoustic model storage means for storing the acoustic models thus 
created; 

feature analysis means for performing a first speech feature 
analysis on the speech data to be recognized on which the noise has been 
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superposed to obtain a feature vector representing cepstrum 
coefficients; 

noise segment/speech segment determination means for determining 
whether the speech data is a noise segment or a speech segment based on 
the result of the feature analysis, and storing the feature vector 
thereof in feature data storage means when a noise segment is detected 
whereas when a speech segment is detected, storing the feature data of 
the speech segment from the beginning through the end thereof in the 
feature data storage means; J 

noise type determination means for determining the type of the 
noise superposed based on the feature vector of the noise segment having 
been stored in the feature data storage means; 

acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the 
noise types based on the result of the determination; 

noise elimination means for eliminating the noise by the cepstrum 
mean normalization method from the speech segment on which the noise has 
been superposed using the feature vector of the speech segment having 
been stored; and 

speech recognition means for performing a speech recognition on the 
feature vector after the noise elimination based on the selected 
acoustic model. 

[Claim 26] A speech recognition apparatus according to one of Claims 
23 to 25, wherein the acoustic models correspond to not only types of 
noise but also a plurality of S/N ratios for each of the noise types, 
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and the acoustic models corresponding to the plurality of S/N ratios for 
each of the noise types are created by generating speech data on which 
noises with the plurality of S/N ratios for each of the noise types have 
been respectively superposed, eliminating the noises from each of the 
speech data by a predetermined noise elimination method, and using the 
feature vectors of each of the speech data which have undergone the 
noise elimination. 

[Claim 27] A speech recognition apparatus according to Claim 26, 
wherein when the acoustic models corresponding to the plurality of S/N f 
ratios for each of the noise types are created, in addition to 
determining the type of the noise superposed on the speech data to be 
recognized, the noise type determination means obtains the S/N ratio 
from the magnitude of the noise in the noise segment and the magnitude 
of the speech in the speech segment, and the acoustic model selection 
means selects an acoustic model based on the noise type determined and 
the S/N ratio obtained. 

[Claim 28] A speech recognition apparatus comprising: 

acoustic models corresponding to each of different types of noise, 
created by generating speech data on which the different types of noise 
have been superposed respectively, eliminating the noise by a spectral 
subtraction method or a continuous spectral subtraction method from each 
of the speech data on which the different types of noise have been 
superposed, applying a cepstrum mean normalization method to each of the 
speech data which have undergone the noise elimination to obtain the 
feature vectors of a speech segment, and using the feature vectors; 


I 


- 1944r ~ 


acoustic model storage means for storing the acoustic models; 

first speech feature analysis means for performing a first speech 
feature analysis to obtain frequency-domain feature data of speech data 
to be recognized; 

noise segment/speech segment determination means for determining 
whether the speech data is a noise segment or a speech segment based on 
the result of the feature analysis, and storing the feature vector 
thereof in feature data storage means when a noise segment is detected; 

noise elimination means for eliminating the noise from the speech "f 
segment by the spectral subtraction method or the continuous spectral 
subtraction method when a speech segment is detected; 

second speech feature analysis means for performing a second speech 
feature analysis on the speech segment data which has undergone the 
noise elimination to obtain cepstrum coefficients, and storing the 
feature vector of the speech segment in the feature data storage means; 

noise type determination means for determining, when the speech 
segment has terminated, the type of the noise superposed based on the 
feature data of the noise segment having been stored; 

acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the 
noise types; 

cepstrum mean normalization operation means for applying the 
cepstrum mean normalization method to the feature vector of the speech 
segment on which the noise has been superposed, using the feature vector 
of the speech segment having been stored, to output the feature vector 
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of the speech segment; and 

speech recognition means for performing a speech recognition on the 
feature vector based on the selected acoustic model. 

[Claim 29] A speech recognition apparatus according to Claim 28, 
wherein the acoustic models correspond to not only types of noise but 
also a plurality of S/N ratios for each of the noise types, and the 
acoustic models corresponding to the plurality of S/N ratios for each of 
the noise types are created by generating speech data on which noises 
with the plurality of S/N ratios for each of the noise types have been J. 
respectively superposed, eliminating the noises from each of the speech 
data by the spectral subtraction method or the continuous spectral 
subtraction method, and using the feature vectors of each of the speech 
data obtained by applying the cepstrum mean normalization method to each 
of the speech data which have undergone the noise elimination. 

[Claim 30] A speech recognition apparatus according to Claim 29, 
wherein when the acoustic models corresponding to the plurality of S/N 
ratios for each of the noise types are created, in addition to 
determining the type of the noise superposed on the speech data to be 
recognized, the noise type determination means obtains the S/N ratio 
from the magnitude of the noise in the noise segment and the magnitude 
of the speech in the speech segment, and the acoustic model selection 
means selects an acoustic model based on the noise type determined and 
the S/N ratio obtained. 

[Claim 31] A speech recognition apparatus comprising: 

acoustic models corresponding to each of different S/N ratios for a 
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particular type of noise, created by generating speech data on which the 
particular type of noise with the different S/N ratios have been 
superposed respectively, eliminating the noise by a predetermined noise 
elimination method from each of the speech data, and using the feature 
vectors of each of the speech data which have undergone the noise 
elimination; 

acoustic models storage means for storing the acoustic models; 

S/N ratio determination means for determining the S/N ratio of a 
noise superposed on speech data to be recognized; 1 

acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the S/N 
ratios based on the result of the determination; 

noise elimination means for eliminating the noise by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise has been superposed; and 

speech recognition means for performing a speech recognition on the 
feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

[Claim 32] A speech recognition apparatus according to Claim 31, 
wherein the noise elimination method is a spectral subtraction method or 
a continuous spectral subtraction method. 

[Claim 33] A speech recognition apparatus according to Claim 31, 
wherein the noise elimination method is a cepstrum mean normalization 
method. 

[Detailed Description of the Invention] 
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[Technical Field of the Invention] 

The present invention relates to a speech recognition method, a 
storage medium storing a speech recognition program, and a speech 
recognition apparatus which ser.ve to achieve a high accuracy of 
recognition even under an environment where various background noises 
are present. 

[Description of the Related Art] 

In recent years, devices incorporating speech recognition 
functionality are being widely used. The devices are used under various? 
environments, and in many cases under noise environments. 

In such cases, obviously, countermeasures must be taken against 
noise. Examples of noise include, for example, stationary noises such 
as the sound of an automobile and the sound of an air conditioner. 
Speech recognition methods described below have hitherto been used for 
speech recognition under an environment where such stationary noises are 
present . 

As a first example, speech recognition may be performed by 
superposing noise data obtained from the stationary noises described 
above on speech data taken under a noise-free environment, creating an 
acoustic model by learning the speech data, and using the acoustic 
model . 

As a second example, speech recognition may be performed, for 
example, by the spectral subtraction method. According to the speech 
recognition method, noise component is eliminated from input speech 
data, and a speech recognition is performed on the speech data which has 
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undergone the noise elimination. In this case as well, similarly to the 
above, noise data obtained from the stationary noises are superposed on 
speech data taken under a noise-free environment, the noise is 
eliminated by the spectral subtraction method from the speech data, an 
acoustic model is created by learning the speech data which has 
undergone the noise elimination, and a speech recognition is performed 
based on the acoustic model. 

[Problems to be Solved by the Invention] 

Use of the speech recognition methods described above improves the? 
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accuracy of recognition under a noisy environment to a certain extent 
compared with the case where no countermeasure is taken. However, 
problems still remain. 

More specifically, the stationary noises include various types such 
as the bustle of the city as well as the sound of an automobile and the 
sound of an air conditioner mentioned above, each having different 
characteristics . 

The acoustic models in the examples above are typically created by 
learning only a particular type of noise. For example, the sound of an 
automobile is used as noise data, the noise data is superposed is 
superposed on speech data, the noise is eliminated from the speech data 
by the spectral subtraction method, and an acoustic model for speech 
recognition is created by learning the speech data which has undergone 
the noise elimination. 

If speech recognition is performed based on an acoustic model 
created for a particular type of noise, a relatively satisfactory result 
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can be obtained under an environment where the type of noise is present. 
However, other types of noise may be present under different 
environments , in which case the recognition rate obviously falls. 

Furthermore, as well as the types of noise, S/N ratio which 
represents the ratio of a speech signal to be recognized to a noise 
signal affects the accuracy of recognition. 

Accordingly, an object of the present invention is to achieve a 
high accuracy of recognition in accordance with the type and S/N ratio 
of a noise and allowing implementation in inexpensive hardware using a 1 
CPU with a relatively low operation capability. 
[Means for Solving the Problems] 

To this end, according to a speech recognition method of the 
present invention, speech data on which different types of noise are 
superposed respectively are created, the noise is eliminated by a 
predetermined noise elimination method from each of the speech data on 
which the noise is superposed, and acoustic models corresponding to each 
of the noise types are created and stored using the feature vectors of 
each of the speech data which have undergone the noise elimination; and 
when a speech recognition is performed, the type of a noise superposed 
on speech data to be recognized is determined, a corresponding acoustic 
model is selected from the acoustic models corresponding to each of the 
noise types based on the result of the determination, the noise is 
eliminated by the predetermined noise elimination method from the speech 
data to be recognized on which the noise is superposed, and a speech 
recognition is performed on the feature vector of the speech data which 
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has undergone the noise elimination based on the selected acoustic 
model . 

According to a storage medium storing a speech recognition program 
of the present invention, the speech recognition program comprises the 
step of creating speech data on which different types of noise are 
superposed respectively, eliminating the noise by a predetermined noise 
elimination method from each of the speech data on which the noise is 
superposed, and creating acoustic models corresponding to each of the 
noise types using the feature vectors obtained by analyzing the features? 
of each of the speech data which have undergone the noise elimination, 
and storing the acoustic models in acoustic model storage means; the 
step of determining the type of a noise superposed on speech data to be 
recognized, and selecting a corresponding acoustic model from the 
acoustic models stored in said acoustic model storage means based on the 
result of the determination; the step of eliminating the noise by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise is superposed; and the step of performing 
a speech recognition on the feature vector of the speech data which has 
undergone the noise elimination based on the selected acoustic model. 

In each of the inventions, the noise elimination method may be the 
spectral subtraction method or the continuous spectral subtraction 
method, and the acoustic models are created by eliminating the noise by 
the spectral subtraction method or the continuous spectral subtraction 
method from e,ach of the speech data on which the different types of 
noise are superposed, obtaining the feature vectors of each of the 
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speech data which have undergone the noise elimination, and using the 
feature vectors. When a speech recognition is performed, a first speech 
feature analysis is performed to obtain the frequency-domain feature 
data of the speech data on which the noise is superposed; it is 
determined whether the speech data is a noise segment or a speech 
segment based on the result of the feature analysis, and when a noise 
segment is detected, the feature data thereof is stored, whereas when a 
speech segment is detected, the type of the noise superposed is 
determined based on the feature data having been stored and a f 
corresponding acoustic model is selected from the acoustic models 
corresponding to each of the noise types based on the result of the 
determination; the noise is eliminated by the spectral subtraction 
method or the continuous spectral subtraction method from the speech 
data to be recognized on which the noise is superposed; and a second 
feature analysis is performed on the speech data which has undergone the 
noise elimination to obtain feature data required in the speech 
recognition, and a speech recognition is- performed on the result of the 
feature analysis based on the selected acoustic model. 

Alternatively, the noise elimination method may the cepstrum mean 
normalization method, and the acoustic models are created by eliminating 
the noise by the cepstrum mean normalization method from each of the 
speech data on which the different types of noise are superposed and 
using the feature vectors of the speech data obtained thereby. When a 
speech recognition is performed, a first speech feature analysis is 
performed on the speech data to be recognized on which the noise is 
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superposed to obtain a feature vector representing cepstrum 

m 

coefficients; it is determined whether the speech data is a noise 
segment or a speech segment based on the result of the feature analysis, 
and when a noise segment is detected, the feature vector thereof is 
stored, and when a speech segment is detected, the feature data of the 
speech segment from the beginning through the end thereof is stored, the 
type of the noise superposed is determined based on the feature vector 
of the noise segment having been stored, and an acoustic model is 
selected from the acoustic models corresponding to each of the noise | 
types based on the result of the determination; the noise is eliminated 
by the cepstrum mean normalization method from the speech segment on 
which the noise is superposed using the feature vector of the speech 
segment having been stored, and a speech recognition is performed on the 
feature vector after the noise elimination based on the selected 
acoustic model. 

Furthermore, the acoustic models may be created corresponding to a 
plurality of S/N ratios for each of the noise types, and the acoustic 
models corresponding to the plurality of S/N ratios for each of the 
noise types are created by generating speech data on which noises with 
the plurality of S/N ratios for each of the noise types are respectively 
superposed, eliminating the noises from each of the speech data by a 
predetermined noise elimination method, and using the feature vectors of 
each of the speech data which have undergone the noise elimination. 

When the acoustic models are created corresponding to the plurality 
of S/N ratios for each of the noise types, in addition to determining 
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the type of the noise superposed on the speech data to be recognized, 
the S/N ratio may be obtained from the magnitude of the noise in the 
noise segment and the magnitude of the speech in the speech segment, and 
an acoustic model is selected based on the noise type determined and the 
S/N ratio obtained. 

According to another speech recognition method, speech data on 
which different types of noise are superposed respectively are created, 
the noise is eliminated by the spectral subtraction method or the 
continuous spectral subtraction method from each of the speech data on 
which the different types of noise are superposed, the cepstrum mean 
normalization method is applied to each of the speech data which have 
undergone the noise elimination to obtain the feature vectors of a 
speech segment, and acoustic models corresponding to each of the noise 
types are created and stored based on the feature vectors. When a 
speech recognition is performed, a first speech feature analysis is 
performed to obtain the frequency-domain feature data of speech data to 
be recognized; it is determined whether the speech data is a noise 
segment or a speech segment based on the result of the feature analysis, 
and when a noise segment is detected, the feature vector thereof is 
stored; and when a speech segment is detected, the noise is eliminated 
from the speech segment by the spectral subtraction method or the 
continuous spectral subtraction method; a second speech feature analysis 
is performed on the speech segment data which has undergone the noise 
elimination to obtain cepstrum coefficients, and the feature vector of 
the speech segment is stored; when the speech segment has terminated, 
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the type of the noise superposed is determined based on the feature data 
of the noise segment having been stored, and an acoustic model is 
selected from the acoustic models corresponding to each of the noise 
types; the cepstrum mean normalization method is applied to the feature 
vector of the speech segment on which the noise is superposed, using the 
feature vector of the speech segment having been stored, to obtain the 
feature vector of the speech segment; and a speech recognition is 
performed on the feature vector obtained by the cepstrum mean 
normalization method based on the selected acoustic model. 

According to another storage medium storing a speech recognition 
program of the present invention, the speech recognition program 
comprises the step of creating speech data on which different types of 
noise are superposed respectively, eliminating the noise by the spectral 
subtraction method or the continuous spectral subtraction method from 
each of the speech data on which the different types of noise are 
superposed, applying the cepstrum mean normalization method to each of 
the speech data which have undergone the- noise elimination to obtain the 
feature vectors of a speech segment, and creating acoustic models 
corresponding to each of the noise types based on the feature vectors 
and storing the acoustic models in acoustic model storage means; the 
step of performing a first speech feature analysis to obtain the 
frequency-domain feature data of speech data to be recognized on which a 
noise is superposed; the step of determining whether ,-the speech data is 
a noise segment or a speech segment based on the result of the feature 
analysis, and storing the feature vector thereof when a noise segment is 


- 304i - 


I 


detected; the step of eliminating the noise from the speech segment by 
the spectral subtraction method or the continuous spectral subtraction 
method when a speech segment is detected; the step of performing a 
second speech feature analysis on the speech segment data which has 
undergone the noise elimination to obtain cepstrum coefficients, and 
storing the feature vector of the speech segment; the step of, when the 
speech segment has terminated, determining the type of the noise 
superposed based on the feature data of the noise segment having been 
stored, and selecting an acoustic model from the acoustic models I 
corresponding to each of the noise types; the step of applying the 
cepstrum mean normalization method to the feature vector of the speech 
segment on which the noise is superposed, using the feature vector of 
the speech segment having been stored, to obtain the feature vector of 
the speech segment; and the step of performing a speech recognition on 
the feature vector obtained by the cepstrum mean normalization method 
based on the selected acoustic model. 

In the speech recognition method and the storage medium storing the 
speech recognition program, the acoustic models may be created 
corresponding to a plurality of S/N ratios for each of the noise types, 
and the acoustic models corresponding to the plurality of S/N ratios for 
each of the noise types are created by generating speech data on which 
noises with the plurality of S/N ratios for each of the noise types are 
respectively superposed, eliminating the noises from each of the speech 
data by the spectral subtraction method or the continuous spectral 
subtraction method, and using the feature vectors of each of the speech 
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data obtained by applying the cepstrum mean normalization method to each 
of the speech data which have undergone the noise elimination. 

When the acoustic models are created corresponding to the pluralit 
of S/N ratios for each of the noise types, in addition to determining 
the type of the noise superposed on the speech data to be recognized, 
the S/N ratio may be obtained from the magnitude of the noise in the 
noise segment and the magnitude of the speech in the speech segment, and 
an acoustic model is selected based on the noise type determined and the 
S/N ratio obtained. 

According to a speech recognition method of the present invention, 
speech data on which a particular type of noise with different S/N 
ratios are superposed respectively are created, the noise is eliminated 
by a predetermined noise elimination method from each of the speech 
data, and acoustic models corresponding to each of the S/N ratios are 
created and stored using the feature vectors of each of the speech data 
which have undergone the noise elimination. When a speech recognition 
is performed, the S/N ratio of a noise superposed on speech data to be 
recognized is determined, a corresponding acoustic model is selected 
from the acoustic models corresponding to each of the S/N ratios based 
on the result of the determination, the noise is eliminated by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise is superposed, and a speech recognition is 
performed on the feature vector of the speech data which has undergone 
the noise elimination based on the selected acoustic model. 

According to a storage medium storing a speech recognition program 
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of the present invention, the speech recognition program comprises the 
step of creating speech data on which a particular type of noise with 
different S/N ratios are superposed respectively, eliminating the noise 
by a predetermined noise elimination method from each of the speech 
data, and creating acoustic models corresponding to each of the S/N 
ratios using the feature vectors of each of the speech data which have 
undergone the noise elimination and storing the acoustic models in 
acoustic model storage means; the step of determining the S/N ratio of 
noise superposed on speech data to be recognized, and selecting a 
corresponding acoustic model from the acoustic models corresponding to 
each of the S/N ratios based on the result of the determination; the 
step of eliminating the noise by the predetermined noise elimination 
method from the speech data to be recognized on which the noise is 
superposed; and the step of performing a speech recognition on the 
feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

In each of the inventions, the noise elimination method may be the 
spectral subtraction method or the continuous spectral subtraction 
method, and the noise elimination method may be the cepstrum mean 
normalization method. 

A speech recognition apparatus of the present invention comprises 
acoustic models corresponding to each of different types of noise, 
created by generating speech data on which the different types of noise 
are superposed respectively, eliminating the noise by a predetermined 
noise elimination method from each of the speech data on which the 
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different types of noise are superposed, and using the feature vectors 
of each of the speech data which have undergone the noise elimination; 
acoustic model storage means for storing the acoustic models; noise 
determination means for determining the type of a noise superposed on 
speech data to be recognized; acoustic model selection means for 
selecting a corresponding acoustic model from the acoustic models 
corresponding to each of the noise types based on the result of the 
determination; noise elimination means for eliminating the noise by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise is superposed; and speech recognition 
means for performing a speech recognition on the feature vector of the 
speech data which has undergone the noise elimination based on the 
selected acoustic model. 

In the speech recognition apparatus, the noise elimination method 
may be the spectral subtraction method or the continuous spectral 
subtraction method, and the acoustic models are created by eliminating 
the noise by the spectral subtraction method or the continuous spectral 
subtraction method from each of the speech data on which the different 
types of noise are superposed, obtaining the feature vectors of each of 
the speech data which have undergone the noise elimination, and using 
the feature vectors. The speech recognition apparatus may comprise 
acoustic model storage means for storing the acoustic models thus 
created; first speech feature analysis means for performing a first 
speech feature analysis to obtain the frequency-domain feature data of 
the speech data on which the noise is superposed; noise segment /speech 
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segment determination means for determining whether the speech data is a 
noise segment or a speech segment based on the result of the feature 
analysis, and when a noise segment is detected, storing the feature data 
thereof in feature data storage means; noise type determination means 
for determining the type of noise the type of the noise superposed based 
on the feature data having been stored when a speech segment is 
detected; acoustic model selection means for selecting a corresponding 
acoustic model from the acoustic models corresponding to each of the 
noise types based on the result of the determination; noise elimination f 
means for eliminating the noise by the spectral subtraction method or 
the continuous spectral subtraction method from the speech data to be 
recognized on which the noise is superposed; second speech feature 
analysis means for performing a second feature analysis on the speech 
data which has undergone the noise elimination to obtain feature data 
required in the speech recognition; and speech recognition means for 
performing a speech recognition on the result of the feature analysis 
based on the selected acoustic model. 

The noise elimination method may be the cepstrum mean normalization 
method, and the acoustic models are created by eliminating the noise by 
the cepstrum mean normalization method from each of the speech data on 
which the different types of noise are superposed and using the feature 
vectors of the speech data obtained thereby. The speech recognition 
apparatus may comprise acoustic model storage means for storing the 
acoustic models thus created; feature analysis means for performing a 
first speech feature analysis on the speech data to be recognized on 
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which the noise is superposed to obtain a feature vector representing 
cepstrum coefficients; noise segment /speech segment determination means 
for determining whether the speech data is a noise segment or a speech 
segment based on the result of the feature analysis, and storing the 
feature vector thereof in feature data storage means when a noise 
segment is detected whereas when a speech segment is detected, storing 
the feature data of the speech segment from the beginning through the 
end thereof in the feature data storage means; noise type determination 
means for determining the type of the noise superposed based on the 
feature vector of the noise segment having been stored in the feature 
data storage means; acoustic model selection means for selecting a 
corresponding acoustic model from the acoustic models corresponding to 
each of the noise types based on the result of the determination; noise 
elimination means for eliminating the noise by the cepstrum mean 
normalization method from the speech segment on which the noise is 
superposed using the feature vector of the speech segment having been 
stored; and speech recognition means for. performing a speech recognition 
on the feature vector after the noise elimination based on the selected 
acoustic model. 

The acoustic models may be created corresponding to a plurality of 
S/N ratios for each of the noise types, and the acoustic models 
corresponding to the plurality of S/N ratios for each of the noise types 
are created by generating speech data on which noises with the plurality 
of S/N ratios for each of the noise types are respectively superposed, 
eliminating the noises from each of the speech data by a predetermined 
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noise elimination method, and using the feature vectors of each of the 
speech data which have undergone the noise elimination. 

When the acoustic models are created corresponding to the plurality 
of S/N ratios for each of the noise types, in addition to determining 
the type of the noise superposed on the speech data to be recognized, 
the noise type determination means may obtain the S/N ratio from the 
magnitude of the noise in the noise segment and the magnitude of the 
speech in the speech segment, and the acoustic model selection means may 
select an acoustic model based on the noise type determined and the S/N f 
ratio obtained. 

Another speech recognition apparatus of the present invention 
comprises acoustic models corresponding to each of different types of 
noise, created by generating speech data on which the different types of 
noise are superposed respectively, eliminating the noise by the spectral 
subtraction method or the continuous spectral subtraction method from 
each of the speech data on which the different types of noise are 
superposed, applying the cepstrum mean normalization method to each of 
the speech data which have undergone the noise elimination to obtain the 
feature vectors of a speech segment, and using the feature vectors; 
acoustic model storage means for storing the acoustic models; first 
speech feature analysis means for performing a first speech feature 
analysis to obtain the frequency-domain feature data of speech data to 
be recognized; noise segment/speech segment determination means for 
determining whether the speech data is a noise segment or a speech 
segment based on the result of the feature analysis, and storing the 
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feature vector thereof in feature data storage means when a noise 
segment is detected; noise elimination means for eliminating the noise 
from the speech segment by the spectral subtraction method or the 
continuous spectral subtraction method when a speech segment is 
detected; second speech feature analysis means for performing a second 
speech feature analysis on the speech segment data which has undergone 
the noise elimination to obtain cepstrum coefficients, and storing the 
feature vector of the speech segment in the feature data storage means; 
noise type determination means for determining, when the speech segment 
has terminated, the type of the noise superposed based on the feature 
data of the noise segment having been stored; acoustic model selection 
means for selecting a corresponding acoustic model from the acoustic 
models corresponding to each of the noise types; cepstrum mean 
normalization operation means for applying the cepstrum mean 
normalization method to the feature vector of the speech segment on 
which the noise is superposed, using the feature vector of the speech 
segment having been stored, to output the feature vector of the speech 
segment; and speech recognition means for performing a speech 
recognition on the feature vector based on the selected acoustic model. 

In the speech recognition apparatus, the acoustic models may be 
created corresponding to a plurality of S/N ratios for each of the noise 
types, and the acoustic models corresponding to the plurality of S/N 
ratios for each of the noise types are created by generating speech data 
on which noises with the plurality of S/N ratios for each of the noise 
types are respectively superposed, eliminating the noises from each of 


- 38 4± - 


the speech data by the spectral subtraction method or the continuous 
spectral subtraction method, and using the feature vectors of each of 
the speech data obtained by applying the cepstrum mean normalization 
method to each of the speech data which have undergone the noise 
elimination. 

When the acoustic models are created corresponding to the pluralit 
of S/N ratios for each of the noise types, in addition to determining 
the type of the noise superposed on the speech data to be recognized, 
the noise type determination means may obtain the S/N ratio from the 
magnitude of the noise in the noise segment and the magnitude of the 
speech in the speech segment, and the acoustic model selection means may 
select an acoustic model based on the noise type determined and the S/N 
ratio obtained. 

A speech recognition apparatus of the present invention comprises 
acoustic models corresponding to each of different S/N ratios for a 
particular type of noise, created by generating speech data on which the 
particular type of noise with the different S/N ratios are superposed 
respectively, eliminating the noise by a predetermined noise elimination 
method from each of the speech data, and using the feature vectors of 
each of the speech data which have undergone the noise elimination; 
acoustic models storage means for storing the acoustic models; S/N ratio 
determination means for determining the S/N ratio of a noise superposed 
on speech data to be recognized; acoustic model selection means for 
selecting a corresponding acoustic model from the acoustic models 
corresponding to each of the S/N ratios based on the result of the 
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determination; noise elimination means for eliminating the noise by the 
predetermined noise elimination method from the speech data to be 
recognized on which the noise is superposed; and speech recognition 
means for performing a speech recognition on the feature vector of the 
speech data which has undergone the noise elimination based on the 
selected acoustic model. 

The noise elimination method may be the spectral subtraction method 
or the continuous spectral subtraction method, and the noise elimination 
method may be the cepstrum mean normalization method. f 

As described above, according to the present invention, speech data 
on which different types of noise are superposed respectively are 
created, and the noise is eliminated from each of the speech data on 
which the noise is superposed, and acoustic models corresponding to each 
of the noise types are created using the speech data which have 
undergone the noise elimination. When a speech recognition is actually 
performed, the type of a noise superposed on speech data to be 
recognized is determined, and an acoustic model is selected from the 
acoustic models corresponding to the noise types based on the result of 
the determination, the noise is eliminated from the speech data to be 
recognized on which the noise is superposed by the predetermined noise 
elimination method, and a speech recognition is performed on the speech 
data which has undergone the noise elimination based on the selected 
acoustic model. 

Accordingly, the speech recognition is performed based on a most 
suitable acoustic model in accordance with the type of the noise 
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superposed, achieving a high recognition rate even under a noisy 
environment . 

In particular, if a device is used under an environment where two 
or three types of stationary noise are present, a high recognition rate 
can be achieved by creating acoustic models for each of the noise types 
and performing a speech recognition as described above based on the 
acoustic models . 

One of the noise elimination method which may be employed according 
to the present invention is the spectral subtraction method or the | 
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continuous spectral subtraction method, in which case the noise 
elimination in the acoustic model creation process is performed by the 
spectral subtraction method or the continuous spectral subtraction 
method. When a speech recognition is actually performed, the type of a 
noise superposed is determined using the feature analysis data of the 
noise segment, a most suitable acoustic model is selected based on the 
result of the determination, the noise is eliminated by the spectral 
subtraction method or the continuous spectral subtraction method from 
the speech data to be recognized on which the noise is superposed, and a 
speech recognition is performed on the result of a feature analysis of 
the speech data which has undergone the noise elimination based on the 
selected acoustic model. 

By employing the spectral subtraction method or the continuous 
spectral subtraction method as described above, noise elimination can be 
executed with a relatively small amount of operations feasible for a CPU 
with a relatively low operation capability. Accordingly, implementation 
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in small-scale inexpensive hardware is allowed. Furthermore, the 
spectral subtraction method and the continuous spectral subtraction 
method are believed to be effective in eliminating noise such as the 
sound of an automobile, the sound of an air conditioner, and the bustle 
of the city (generally referred to as additive noise) , it is highly 
advantageous when applied to devices typically used under an environment 
with a considerable amount of such noise. 

As another example of noise elimination method, the cepstrum mean 
normalization method may be employed. In that case, the noise 
elimination in the acoustic model creating process employs the cepstrum 
mean normalization method. When a speech recognition is actually 
performed, the type of a noise superposed is determined using the 
feature analysis data of the noise segment, a most suitable acoustic 
model is selected based on the result of the determination, the noise is 
eliminated from the speech data to be recognized on which the noise is 
superposed by the cepstrum normalization method, and a speech 
recognition is performed on the feature vector obtained by the .noise 
elimination based on the selected acoustic model. 

By employing the cepstrum mean normalization method as the noise 
elimination method, noise elimination can be performed with a small 
amount of operations feasible for a CPU with relatively low operation 
capability. This allows implementation in small-scale inexpensive 
hardware. Furthermore, because the cepstrum mean normalization method 
is believed to be effective in eliminating noise such as distortions due 
to microphone characteristics and spatial transmission characteristics 
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including echo (generally referred to as multiplicative noise) , it is 
highly advantageous when applied to devices typically used under an 
environment where such noise is likely to be present. 

Furthermore, in addition to the noise types, the acoustic models 
may be created for different S/N ratios for each of the noise types, so 
that when a speech recognition is actually performed, the S/N ratio of 
the noise superposed on the speech data to be recognized is obtained 
from the power of the noise segment and the power of the speech segment 
and an acoustic model in accordance with the S/N ratio and the noise 
type is selected, allowing recognition based on an acoustic model in 
accordance with the power of the noise as well as the noise type. 
Accordingly, a high recognition rate can be achieved when a speech 
recognition is performed under environments where each of the noises 
exist . 

Furthermore, the acoustic models may be created using both the 
spectral subtraction method or the continuous spectral subtraction 
method and the cepstrum normalization method. In this case, when a 
speech recognition is actually performed, noise is eliminated by the 
spectral subtraction method or the continuous spectral subtraction 
method, and the feature vector of the speech data which has undergone 
the noise elimination is generated by the cepstrum mean normalization 
method and the feature vector is supplied to the speech recognition unit 
for speech recognition, achieving a high accuracy of recognition, and in 
this case, allowing compatibility with a wide range of noise including 
the additive noise and the multiplicative noise described earlier. 
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Furthermore, the present invention may be applied to speech 
recognition involving a particular type of noise with a plurality of S/N 
ratios. In that case, speech data on which the particular type of noise 
with the plurality of S/N ratios are respectively superposed are 
created, the noise is eliminated by a predetermined noise elimination 
method from each of the speech data, and acoustic models corresponding 
to each of the S/N ratios are created using the feature vectors of each 
of the speech data which have undergone the noise elimination. When a 
speech recognition is actually performed, the S/N ratio of a noise 
superposed on speech data to be recognized is determined, a 
corresponding acoustic model is selected from the acoustic models 
corresponding to each of the S/N ratios, the noise is eliminated by the 
predetermined noise elimination method from the speech data on which the 
noise is superposed, and a speech recognition is performed on the 
feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

This is advantageous in performing. a speech recognition under an 
environment where the type of noise can be identified but the magnitude 
(S/N ratio) tends to vary, achieving a high recognition rate under such 
an environment. 

[Description of the Embodiments] 

Embodiments of the present invention will be described below. The 
description of the embodiments includes a speech recognition method and 
speech recognition apparatus according to the present invention as well 
as specific processes of a speech recognition program stored on a 
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storage medium according to the present invention. 

Basically, the present invention eliminates noise superposed on a 
speech to be processed and performs a speech recognition on the speech 
data from which the noise has been eliminated. With regard to acoustic 
models used in the speech recognition, several types of noise 
(stationary noise) are assumed, each of the noises is superposed on 
speech data corresponding to a speech (clean speech data that does not 
contain any noise) to generate speech data with the noise superposed 
thereon, the noise is eliminated from the speech data on which the noise 
has been superposed, and the acoustic models are created using the 
speech waveform after the noise elimination process (which somewhat 
varies from the clean speech data that does not contain any noise) . 

That is, the acoustic models from which noise has been 
substantially eliminated are created using the above procedure for each 
of the predefined noise types. 

When a speech recognition is actually performed, the type of a 
noise superposed on speech data to be recognized is determined, the 
noise is eliminated, an acoustic model is selected according to the type 
of the noise, and the speech recognition is performed based on the 
selected acoustic model. 

Furthermore, the acoustic models are created for different values 
of S/N ratio which represents the ratio of the magnitudes of speech data 
and noise, as well as for each of the noise types. For example, if 
three types of noise Nl, N2, and N3 are selected, three acoustic models 
are created taking only the noise types into consideration. If two 
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different S/N ratios are assumed for each of the noises, the acoustic 
models are created in the above procedure with two different magnitudes 
for each of the noises, thus resulting in six acoustic models. 

For example, if two S/N ratio levels, namely, an S/N ratio smaller 
than a certain value LI (S/N < LI) and an S/N ratio greater than or 
equal to LI (S/N > LI) are considered, an acoustic model for an S/N 
ratio smaller than LI and an acoustic model for an S/N ratio greater 
than or equal to LI are created for the noise Nl . Similarly, for each 
of the noise N2 and the noise N3,* two acoustic models, i.e., an acoustic 
model for an S/N ratio smaller than LI and an acoustic model for an S/N 
ratio greater than or equal to LI, are created. Thus, in total, six 
acoustic models are created. 

Techniques for the noise elimination described above includes the 
spectral subtraction (hereinafter referred to as SS) method and the 
continuous spectral subtraction (hereinafter referred to as CSS) method. 
These methods are believed to be particularly effective in eliminating 
noise whose source is hard to be located (referred to as additive noise 
as described earlier) , such as the sound of an automobile, the sound of 
an air conditioner, and the bustle of the city. 

In addition to the SS method and the CSS method, another noise 
elimination method is the cepstrum mean normalization (hereinafter 
referred to as CMN) method. This method is believed to be effective in 
eliminating noise such as distortions due to microphone characteristics 
and spatial transmission characteristics including echo (referred to as 
multiplicative noise as described earlier) . 
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The present invention will be described in relation to a first 
embodiment in which the SS method or the CSS method is employed for 
noise elimination, a second embodiment in which the CMN method is 
employed, and a third embodiment in which both of them are employed. 
[ First Embodiment ] 

Fig. 1 is a diagram showing the schematic construction of a speech 
recognition apparatus according to the first embodiment of the present 
invention, including the following components: a microphone 1; an input 
speech processing unit 2 including an amp and an A/D converter; a first 7 
speech feature analysis unit 3; a noise segment/speech segment 
determination unit 4; a feature analysis data storage unit 5; a noise 
type determination/acoustic model selection unit 6; an acoustic model 
storage unit 7; a noise elimination unit 8; a second speech feature 
analysis unit 9; a speech recognition unit 10; and a language model 
storage unit 11. The functions and operations of each of the components 
will be described below with reference to a flowchart shown in Fig. 2. 

Referring to Fig. 2, the first speech feature analysis unit 3 
analyzes the speech feature of speech data to be recognized which has 
undergone an A/D conversion, on a frame-by-frame basis (the duration of 
each frame is, for example, on the order of 20 to 30 msec) (step si) . 
The speech feature analysis is performed in the frequency domain, for 
example, by FET (Fast Fourier Transform) . 

The noise segment/speech segment determination unit 4 determines 
whether the speech data is a noise segment or a speech segment" based on 
the power, frequency characteristics, etc. obtained by the speech 
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feature analysis (step s2) . If the speech data is determined as a noise 
segment, the feature data of the most recent n frames is stored in the 
feature data storage unit 5 (step s3) . The processes of steps si to s3 
are repeated until a speech segment is detected, when the noise type 
determination/acoustic model selection unit 6 starts determination of 
the noise type and selection of an acoustic model. The noise type 
determination and the acoustic model selection will be described below. 

First, it is determined whether the start of a noise type 
determination and acoustic model selection has been requested (step s4),J 
and if a request has been made, the type and magnitude (S/N ratio) of 
the noise is determined and an acoustic model is selected based on the 
result (step s5) . 

The type and magnitude of the noise is determined using the feature 
data of the most recent n frames of the noise segment stored in the 
feature data storage unit 5 and the feature data of each of the several 
frames of speech "segment obtained by the first speech feature analysis. 
These feature data represent power as well as frequency characteristics, 
so that the power of the speech is recognized as well as the type and 
power of the noise. 

For example, in the first embodiment, stationary noise such as the 
sound of an automobile, the sound of an air conditioner, and the bustle 
of the city are assumed. Three types of such stationary noise will be 
considered herein, respectively designated as noise Nl, noise N2, and 
noise N3 . Examination of the feature data of the n frames of the noise 
segment allows determination as to whether the noise segment is most 
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similar to the noise Nl, the noise N2, or the noise N3. 

Furthermore, the S/N ratio can be obtained from the power of the 
noise and the power of the speech. Because the S/N ratio must be 
calculated when the power of the speech segment has a magnitude of a 
certain degree, the S/N ratio is calculated using the maximum value or 
the mean value of several frames or all the frames in the speech 
segment . 

When the noise type is determined and the S/N ratio is obtained in 
the manner described above, next, an acoustic model is then selected. f 
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In the first embodiment, acoustic models are created while assuming the 
three types of stationary noise Nl, N2, and N3, an acoustic model for an 
S/N ratio smaller than LI and an acoustic model for an S/N ratio greater 
than or equal to LI for each of the noise types Nl, N2, and N3. 

For example, in the first embodiment, the noise Nl with an S/N 
ratio smaller than LI is associated with an acoustic model Ml, the noise 
Nl with an S/N ratio greater than or equal to LI is associated with an 
acoustic model M2, the noise N2 with an .S/N ratio smaller than LI is 
associated with an acoustic model M3, the noise N2 with an S/N ratio 
greater than or equal to LI is associated with an acoustic model M4, the 
noise N3 with an S/N ratio smaller than LI is associated with an 
acoustic model M5, and the noise N3 with an S/N ratio greater than or 
equal to LI is associated with an acoustic model M6. The six acoustic 
models Ml, M2, and M6 are stored in the acoustic model storage unit 

7. The acoustic models Ml, M2, and M6 are created as follows. 

Six patterns of noise, i.e., two different S/N ratios (smaller than 
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LI, and greater than or equal to LI) for each of the noises Nl, N2, and 
N3, are prepared, and the six patterns of noise are superposed on speech 
data that does not contain any noise, whereby six patterns of speech 
data are created. 

The six patterns of speech data are: speech data on which the noise 
Nl with an S/N ratio smaller than LI is superposed; speech data on which 
the noise Nl with an S/N ratio greater than or equal to LI is 
superposed; speech data on which the noise N2 with an S/N ratio smaller 
than LI is superposed; speech data on which the noise N2 with an S/N f 
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ratio greater than or equal to LI is superposed; speech data on which 
the noise N3 with an S/N ratio smaller than LI is superposed; and speech 
data on which the noise N3 with an S/N ratio greater than or equal to LI 
is superposed. 

Noise is eliminated from each of the six patterns of speech data by 
a predetermined noise elimination method, and the six acoustic models 
Ml, M2, and M6 are created using the feature vectors obtained by 

analyzing the six patterns of speech data which have undergone the noise 
elimination. 

If it is determined in step s5, for example, that the type of the 
noise is most similar to the noise Nl and the S/N ratio obtained is 
smaller than LI (S/N < LI), the acoustic model Ml is selected from the 
acoustic model storage unit 7. 

When an acoustic model is selectedin accordance with the noise 
type and the S/N ratio, then, the noise is eliminated by the noise 
elimination unit 8 (step s6) . The noise elimination employs the SS 
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method or the CSS method in the first embodiment, and performs a 
spectral subtraction using the feature data of the most recent n frames 
of the noise segment stored in the feature data storage unit 5 and the 
feature data of the speech segment. Thus, speech data from which the 
noise has been substantially eliminated is obtained. Even after the 
noise elimination, however, the speech data includes a slight residual 
of the noise. 

Then, the second speech feature analysis unit 9 analyzes the 
feature of the speech data which has undergone the noise elimination f 
(step s7). The feature analysis by the second speech feature analysis 
unit 9 will be referred to as the second feature analysis herein. 

The second speech feature analysis obtains cepstrum coefficients 
which will be used when the speech recognition unit 10 performs a speech 
recognition. Because the speech analysis in step si employs a 
frequency-domain analysis method such as FFT, the result thereof being 
speech feature data in the frequency domain, the second speech feature 
analysis obtains mel frequency cepstrum coefficients as the cepstrum 
coefficients . 

The mel cepstrum coefficients obtained by the second speech feature 
analysis are supplied to the speech recognition unit 10, and the speech 
recognition unit 10 performs a speech recognition on the mel cepstrum 
coefficients. The speech recognition is performed based on the acoustic 
model selected in step s5 (the acoustic model Ml in the example 
described earlier) and a language model stored in the language model 
storage unit 11. 


4. 
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When the second speech feature analysis in step s7 is completed, it 
is determined whether the speech segment has terminated (step s8), if 
the speech segment has completely terminated, the processing is exited, 
and if the speech segment has not terminated, the processing returns to 
step si and the same processes are repeated. 

That is, the first speech feature analysis is performed (step si), 
it is determined whether the speech data is a noise segment or a speech 
segment (step s2), and if the speech data is determined as a speech 
segment, the processing proceeds to step s4 and the subsequent 7 
processes. If no request has been made for the selection of an acoustic 
model, it is determined whether the determination of the noise type and 
magnitude (S/N ratio) and the selection of an acoustic model based on 
the result thereof have been completed (step s9) , and if the process has 
been completed, the noise elimination is performed (step s6) , whereas if 
the process has not been completed, the feature data of the speech 
segment obtained by the first speech feature analysis is stored (step 
slO) . 

The series of processes are repeated until the speech segment 
terminates. As described above, an acoustic model is selected in 
accordance with the type and S/N ratio of a noise superposed on speech 
data to be recognized, and a speech recognition is performed based on 
the selected acoustic model and a predefined language model. 

As described earlier, the six acoustic models Ml, M2, *••, and M6 
in the first embodiment have been created by superposing the three types 
of noise Nl, N2, and N3, with two S/N ratios for each, on speech data 
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(clean speech data that does not contain any noise) to generate the six 
patterns of speech data and eliminating the noise from each of the six 
patterns of speech data (by the SS method or the CSS method) , and using 
the six patterns of speech data which have undergone the noise 
elimination (including a slight residual of the noise as opposed to the 
clean speech data that does not contain any noise) . Thus, the six 
acoustic models have been created based on speech data similar to actual 
speech data to be recognized. 

Thus, a most suitable acoustic model is selected for the actual 7 
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speech data to be recognized in accordance with the type and S/N ratio 
of the noise superposed on the speech data, and a speech recognition is 
performed based on the selected acoustic model, thereby enhancing the 
accuracy of the recognition. 

Furthermore, the first embodiment employs the SS method or the CSS 
method for noise elimination, reducing the amount of operations required 
for the noise elimination to such an extent feasible for a CPU with a 
relatively low operation capability. 

This allows implementation in small-scale inexpensive hardware. 
Furthermore, because the SS method and the CSS method are believed to be 
effective for elimination of noise such as the sound of an automobile, 
the sound of an air conditioner, and the bustle of the city, the first 
embodiment is highly advantageous when applied to devices typically used 
in environments with a considerable amount of such noise. 
[Second Embodiment] 

The second embodiment employs the cepstrum mean normalization (CMN) 
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method for noise elimination. Fig. 3 shows the schematic construction 
of a speech recognition apparatus according to the second embodiment of 
the present invention, including the following components: a microphone 
1; an input speech processing unit 2 including an amp and an A/D 
converter; a speech feature analysis unit 21; a noise segment/speech 
segment determination unit 4; a feature data storage unit 5; a noise 
type determination/acoustic model selection unit 6; a acoustic model 
storage unit 7; a noise elimination unit 8; a speech recognition unit 
10; and a language model storage unit 11. The functions and operations 
of each of the components will be described below with reference to a 
flowchart shown in Fig. 4. 

Referring to Fig. 4, the speech feature analysis unit 21 analyzes 
the feature of speech data to be processed which has undergone an A/D 
conversion, on a frame-by-frame basis (the duration of each frame is, 
for example, on the order of 20 to 30 msec) (step s21) . The speech 
feature analysis in the second embodiment obtains cepstrum coefficients 
(e.g., mel frequency cepstrum coefficients or LPC cepstrum 
coefficients ) . 

The noise segment /speech segment determination unit 4 determines 
whether the speech data is a noise segment or a speech segment based on 
the result of the speech feature analysis (step s22) . If it is 
determined that the speech data is a noise segment, the noise 
segment/speech segment determination unit 4 further determines whether 
the noise segment exists at the beginning or the end of the speech 
segment along the time axis {step s23) . 
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Based on the result of the determination, if the noise segment 
exists at the beginning of the speech segment along the time axis, the 
feature data (the feature vector of cepstrum coefficients) of the most 
recent nl frames obtained by the feature analysis is stored in the 
feature data storage unit 5 (step s24). If the speech data is 
determined as a speech segment, the feature data (the feature vector of 
cepstrum coefficients) of n2 frames of the speech segment (from the 
beginning to the end of the speech segment) is stored in the feature 
data storage unit 5 (step s25) . 

The speech feature analysis is repeated until a noise segment is 
detected and the noise segment is determined as existing at the end of 
the speech segment along the time axis (steps s21, s22, and s23) , when 
it is determined that the speech segment has terminated, and the feature 
data (the feature vector of cepstrum coefficients) of n3 frames after 
the end of the speech segment is. stored in the feature data storage unit 
5 (step s26) . 

Then, it is determined whether the. storage of the feature data of 
the n3 frames has been completed (step s27), and if the process has been 
completed, the noise type determination/acoustic model selection unit 6 
starts the determination of the noise type and selection of an acoustic 
model (step s28) . The noise type determination and the acoustic model 
selection will be described below. 

The determination of the noise type and the S/N ratio and the 
selection of an acoustic model are performed using the feature data of 
the nl frames and the n2 frames which have been stored in the feature 
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data storage unit 5. 

More specifically, which of the noise types is most similar to the 
noise can be determined using the feature data of the noise segment 
(e.g., the feature data of the nl frames), and the S/N ratio can be 
determined from the power of the noise obtained by analyzing the feature 
of the noise segment and the power of the speech segment. 

In the second embodiment as well, the processing is based on the 
three noise types Nl, N2, and N3 . 

Based on the noise type determined and the S/N ratio obtained, one"? 
of the acoustic models is selected. Similarly to the first embodiment 
described earlier, for example, if the noise type is determined as most 
similar to the noise Nl and the S/N ratio is smaller than LI, an 
acoustic model Ml is selected. 

In the second embodiment, similarly to the first embodiment, six 
acoustic models Ml, M2, and M6 in accordance with the noise type 

and the S/N ratio are prepared. 

More specifically, in the second embodiment, similarly to the first 
embodiment, the noise Nl with an S/N ratio smaller than LI is associated 
with an acoustic model Ml, the noise Nl with an S/N ratio greater than 
or equal to LI is associated with an acoustic model M2, the noise N2 
with an S/N ratio smaller than LI is associated with an acoustic model 
M3, the noise N2 with an S/N ratio greater than or equal to LI is 
associated with an acoustic model M4, the noise N3 with an S/N ratio 
smaller than LI is associated with an acoustic model M5, and the noise 
N3 with an S/N ratio greater than or equal to LI is associated with an 
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acoustic model M6. The six acoustic models Ml, M2, and M6 are 

stored in the acoustic model storage unit 7. 

Because the noise elimination employs the CMN (cepstrum mean 
normalization) method in the second embodiment, the acoustic models Ml, 
M2, and M6 are created by the CMN method. More specifically, the 

acoustic models Ml, M2, **•, and M6 are created as follows. 

Six patterns of noise with two different S/N ratios (smaller than 
LI, and greater than or equal to LI) for each of the noises Nl, N2, and 
N3 are prepared, and the six patterns of noise are superposed on speech f 
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data that does not contain any noise, whereby six patterns of speech 
data are created. 

The six patterns of speech data are: speech data on which the noise 
Nl with an S/N ratio smaller than LI is superposed; speech data on which 
the noise Nl with an S/N ratio greater than or equal to LI is 
superposed; speech data on which the noise N2 with an S/N ratio smaller 
than LI is superposed; speech data on which the noise N2 with an S/N 
ratio greater than or equal to LI is superposed; speech data on which 
the noise N3 with an S/N ratio smaller than LI is superposed; and speech 
data on which the noise N3 with an S/N ratio greater than or equal to LI 
is superposed. 

Noise is eliminated from each of the six patterns of speech data by 
the CMN method, and the six acoustic models Ml, M2, ••*, and M6 are 
created using the feature vectors of the six patterns of speech data 
from which the noise has been eliminated. 

If it is determined in step s28, for example, that the noise is 
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most similar to the noise Nl, and the S/N ratio obtained is smaller than 
LI, the acoustic model Ml is selected from the acoustic model storage 
unit 7 . 

The type and magnitude (S/N ratio) of the noise may be determined 
only from the feature data of the nl frames (the feature data of the 
noise at the beginning of the speech segment) and the feature data of 
the n2 frames (the feature data of the speech segment from the beginning 
to the end thereof), the feature data of the n3 frames (the feature data 
of the noise at the end of the speech segment) may be used in addition. 

Then, the noise elimination unit 8 eliminates the noise by the CMN 
method. In the noise elimination by the CMN method, first, the mean 
feature vector of the n2 frames is obtained using the feature vector 
obtained by the speech feature analysis of the speech segment (the 
feature vector of the n2 frames) {step s29) . 

The mean feature vector may be obtained using all the feature 
vectors of nl, n2, and n3 frames instead of only the feature vector of 
the n2 frames. It will be assumed herein that the mean feature vector 
is obtained using only the feature vector of the n2 frames of the speech 
segment from the beginning to the end thereof. 

If, for example, n = 20, the mean of the feature vectors of the 20 
frames (designated as CI, C2, and C20, each having, for example, 

10th order components) is obtained. The mean feature vector obtained is 
designated as Cm. 

Then, using the mean feature vector obtained, the feature vectors 
of the speech segment (20 Irames in this example) are recalculated (step 
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s30) . The recalculation subtracts the mean feature vector Cm from each 
of the feature vectors CI, C2, and C20 of the 20 frames of the 

speech segment, i.e., in this example, Cl f = CI - Cm, C2 1 = C2 - Cm, 

C20' = C20 - Cm. CI', C2 1 , and C20' which have been obtained 

are the feature vectors of the 20 frames after the noise elimination. 

The feature vectors CI 1 to C20 f are supplied to the speech 
recognition unit 10, and the speech recognition unit 10 performs a 
speech recognition based on the selected acoustic model and a predefined 
language model. J 
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As described above, in the second embodiment, similarly to the 
first embodiment described earlier, an acoustic model is selected in 
accordance with the noise type and the S/N ratio, and a speech 
recognition is performed using the selected acoustic model and the 
language model stored in the language model storage unit 11. 

Similarly to the first embodiment, the six acoustic models in the 
second embodiment are created by superposing the three types of noise 
Nl, N2, and N3, with two different S/N ratios for each, on speech data 
(clean speech data that does not contain any noise) to generate the six 
patterns of speech data with the noise superposed thereon, eliminating 
the noise from each of the six patterns of speech data by the CMN 
method, and using the six patterns of speech data which have undergone 
the noise elimination (including slight residue of the noise as opposed 
to the clean speech data that does not contain any noise) . That is, the 
six acoustic models are created based on speech data similar to actual 
speech data to be recognized. 
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Thus, a most suitable acoustic model is selected in accordance with 
the type and the S/N ratio of the noise superposed on the actual speech 
data to be recognized, and a speech recognition is performed using the 
selected acoustic model, thereby enhancing the accuracy of recognition. 

Furthermore, the CMN method used for noise elimination in the 
second embodiment serves to reduce the amount of operations associated 
with noise elimination to such an extent feasible for a CPU with a 
relatively low operation capability, allowing implementation in small- 
scale inexpensive hardware. Furthermore, because the CMN method is 
believed to be effective in eliminating noise due to microphone 
characteristics and spatial transmission characteristics including echo 
(multiplicative noise), it is highly advantageous when applied to 
devices typically used in environments where such noise is likely to be 
present . 

[Third Embodiment] 

The third embodiment combines the first embodiment and the second 
embodiment. In the third embodiment, similarly to the first and the 
second embodiments, six acoustic models Ml, M2, ••*, and M6 are prepared 
in accordance with the noise types and the S/N ratios. The acoustic 
models in the third embodiment are created as follows. 

As described earlier, the three types of noise Nl, N2, and N3 with 
two S/N ratios for each are superposed on speech data (clean speech data 
that does not contain any noise) to generate six patterns of speech data 
with noise superposed thereon, and the noise is eliminated from each of 
the six patterns of speech data by the SS method or the CSS method to 
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generate six patterns of speech data from which the noise has been 
substantially eliminated (including slight residue of the noise as 
opposed to the clean speech data that does not contain any noise) . 

Then, the CMN method is performed on the six patterns of speech 
data which have undergone the noise elimination by the SS method or the 
CSS method. More specifically, as described earlier, the mean feature 
vector of the n2 frames is obtained using the feature vectors obtained 
by the feature analysis of the speech segment in each of the speech data 
(the feature vectors of the n2 frames) . If, for example, n2 = 20, the 
mean feature vector Cm of the 20 frames (indicated by CI C2, and 
C20, each having, for example, 10th order components) is obtained. 

Then, using the mean feature vector obtained, the feature vectors 
of the speech segment (20 frames in this example) are recalculated, 
i.e., CI' = CI - Cm, C2* = C2 - Cm, C20' - C20 - Cm, obtaining the 

feature vectors CI, C, and C20 of each of the 20 frames (of the 

speech segment), and the acoustic models are created using the feature 
vectors of each of the frames. 

The process is performed with two different S/N ratios for each of 
the noises Nl, N2, and N3, creating the six acoustic models Ml, M2, 
and M6 . 

Fig. 5 is a diagram showing the schematic construction of a speech 
recognition apparatus according to the third embodiment, including the 
components as follows: a microphone 1; an input speech processing unit 2 
including an amp and an A/D converter; a first speech feature analysis 
unit 3; a noise segment/speech segment determination unit 4; a feature 
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data storage unit 5; a noise type determination/acoustic model selection 
unit 6; a acoustic model storage unit 7; a noise elimination unit 8; a 
second speech feature analysis unit 9; a CMN operation unit (CMN noise 
elimination unit) 31; a speech recognition unit 10; and a language model 
storage unit 11. The components will be described below with reference 
to a flowchart shown in Fig. 6. 

Referring to Fig. 6, first, the first speech feature analysis unit 
3 analyzes the feature of speech data to be recognized which has 
undergone an A/D conversion, on a frame-by-frame basis (the duration of 
each frame is, for example, on the order of 20 to 30 msec) (step s41) . 
The speech feature analysis is performed in the frequency domain, and 
similarly as described earlier, for example, by FFT (Fast Fourier 
Transform) . 

Based on the result of the speech feature analysis, the noise 
segment/speech segment determination unit 4 determines whether the 
speech data is a noise segment or a speech segment (step s42) . If the 
speech data is determined as a noise segment, the noise segment /speech 
segment determination unit 4 further determines whether the noise 
segment exists at the beginning or the end of the speech segment along 
the time axis (step s43) . If the noise segment is determined as 
existing at the beginning of the speech segment along the time axis, the 
feature data of the most recent nl frames is stored in the feature data 
storage unit 5 (step s44). 

If the speech data is determined as a speech segment, the noise 
elimination unit 8 eliminates noise by the SS or the CSS method (step 
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s45) . Then, the second speech feature analysis unit 9 analyzes the 
feature of the speech data which has undergone the noise elimination 
(step s46), and the speech feature data (feature vector) obtained 
thereby is stored (step s47) . The second speech feature analysis 
obtains mel frequency cepstrum coefficients. 

The processing returns to step s41, in which the first speech 
feature analysis is repeated, and based on the result of the speech 
feature analysis, it is determined whether the speech data is a noise 
segment or a speech segment. If it is determined that the speech data 
is a noise segment and the noise segment exists at the end of the speech 
segment along the time axis (steps s41, s42, and s43) , it is determined 
that the speech segment has terminated, and the noise type determined 
and the acoustic model selection in step s48 start. 

The determination of the noise type and the magnitude (S/N ratio) 
and the selection of an acoustic model are performed using the speech 
feature data of the nl frames and the n2 frames which have been stored. 
More specifically, which of the three noise types (Nl, N2, and N3) 
described earlier is most similar to the feature data of the noise 
segment can be determined using the feature data of the noise segment 
{e.g., the feature data of the nl frames), and the S/N ratio can be 
determined by the power of the noise segment obtained from the feature 
data of the noise segment and the power of the speech obtained from the 
feature data of the speech segment. 

Based on the noise type and the S/N ratio, one of the acoustic 
models is selected. Similarly to the first and the second embodiments, 
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for example, if the noise type is most similar to the noise Nl and the 
S/N ratio is smaller than LI, the acoustic model Ml is selected. 

When the acoustic model selection is complete, the CMN operation 
unit 31 generates speech feature data required for speech recognition 
(steps s49 and s50) . The feature data is generated using the CMN noise 
elimination method described earlier. 

As described in relation to the second embodiment, the CMN method 
obtains the mean feature vector Cm of the n2 frames in the procedure 
described earlier using the feature vectors obtained by the feature 7 
analysis of the speech segment (the feature vectors of the n2 frames) . 
Using the mean feature vector Cm, the feature vectors of the speech 
segment (20 frames in this example) are recalculated, i.e., CI 1 = CI - 
Cm, C2 f = C2 - Cm, and C20' = C20 - Cm. 

CI', C2 1 , C20 T which have been obtained are the feature 

vectors of each of the 20 frames. The feature vectors CI 1 , C2 f , 
C20' of each of the frames are supplied to the speech recognition unit 
10, and the speech recognition unit 10 performs a speech recognition 
using the selected acoustic model and a language model stored in the 
language model storage unit 11. 

As described above, in the third embodiment, similarly to the first 
and the second embodiments, an acoustic model is selected in accordance 
with the noise type and the S/N ratio, and a speech recognition is 
performed using the selected acoustic model and a predefined language 
model. 

In the third embodiment, acoustic models are created using both the 
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SS method (or the CSS method) and the CMN method. When a speech 
recognition is actually performed, noise is eliminated by the SS method 
(or the CSS method) , a feature vector is generated by the CMN method 
from the speech data which has undergone the noise elimination, and the 
feature vector is supplied to the speech recognition unit 10 for speech 
recognition, thereby enhancing the accuracy of recognition. 
Furthermore, the third embodiment can be suitably applied to a wide 
range of noise including additive noise and multiplicative noise. 

The present invention is not limited to the embodiments described 
hereinabove, and various modifications can be made without departing 
from the gist of the present invention. For example, although an 
example of three noise types Nl, N2, and N3 with two different S/N 
ratios for each is given in the embodiments, the present invention is 
not limited thereto. 

Furthermore, with respect to the noise types, instead of viewing 
the sound of an automobile, the sound of an air conditioner, and the 
bustle of the city as individual noises,- a combination of several noises 
may be considered as a single noise. 

As an example, acoustic models for speech recognition may be 
created by superposing both the sound of an automobile and the sound of 
an air conditioner on speech data taken in a noise-free environment, 
eliminating the noise from the speech data by a predetermined noise 
elimination method, and learning the speech data which has undergone the 
noise elimination. 

As described above, plural types of acoustic models can be created 
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as desired for combinations of stationary noise likely to be present in 
environments where devices are used. Thus, by preparing several 
acoustic models most suitable for individual devices, a high recognition 
rate can be achieved. Furthermore, even better results can be obtained 
by preparing acoustic models for different S/N ratios. 

Furthermore, the constructions of the speech recognition 
apparatuses shown in Figs. 1, 3, and 5 are examples of implementation, 
and the constructions need not be exactly as shown in the figures. For 
example, although means for determining the noise type and means for 
selecting an acoustic model are implemented in a single unit as the 
noise type determination/acoustic model selection unit 6, it is to be 
understood that the noise type determination means and the acoustic 
model selection means may be provided as separate components. 

Furthermore, although the embodiments have been described in 
relation to an example of a plurality of (three) noise types and a 
plurality of (two) S/N ratio for each of the noise types, the present 
invention may be applied to speech recognition involving a particular 
noise (one noise type) with a plurality of S/N ratios. 

In that case, speech data on which the particular type of noise 
with different S/N ratios have been superposed respectively are 
generated, the noise is eliminated from each of the speech data by a 
predetermined noise elimination method, and acoustic models 
corresponding to each of the S/N ratios are created using the feature 
vectors of the speech data which have undergone the noise elimination. 

When a speech recognition is actually performed, the S/N ratio of 
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the noise superposed on speech data to be recognized is determined, and 
an acoustic model is selected from the acoustic models corresponding to 
each of the S/N ratios, the noise is eliminated by the predetermined 
noise elimination method from the speech data to be recognized on which 
the noise has been superposed, and a speech recognition is performed on 
the feature vector of the speech data which has undergone the noise 
elimination based on the selected acoustic model. 

Although not shown, a speech recognition apparatus in that case 
includes: acoustic models corresponding to different S/N ratios created 
by generating speech data on which different types of noise with 
different S/N ratios for each have been superposed respectively, 
eliminating the noise by a predetermined noise elimination method from 
each of the speech data, and using the feature vectors of each of the 
speech data which have undergone the noise elimination; acoustic model 
storage means for storing the acoustic models; S/N ratio determination 
means for determining the S/N ratio of a noise superposed on speech data 
to be recognized; acoustic model selection means for selecting an 
acoustic models from the acoustic models corresponding to different S/N 
ratios based on the result of the determination; noise elimination means 
for eliminating the noise by the predetermined noise elimination method 
from the speech data to be recognized on which the noise has been 
superposed; and speech recognition means for performing a speech 
recognition on the feature vector of the speech data which has undergone 
the noise elimination based on the selected acoustic model. 

In this case as well, the noise elimination may employ the SS 
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method (or the CSS method) and the CMN method, and by the processing 
described in relation to the first, the second, and the third 
embodiments, the S/N ratio of the noise superposed on speech data to be 
recognized is determined, an acoustic model is selected in accordance 
with the S/N ratio, and a speech recognition is performed based on the 
selected acoustic model. 

This is advantageous in performing a speech recognition in 
environments where the noise type can be identified but the magnitude 
(S/N ratio) thereof tends to vary, achieving a high recognition rate in 
such environments. In this case, the noise type has been identified and 
it is not necessary to determine the noise type, reducing the overall 
amount of operations to such an extent feasible for a CPU with a 
relatively low operation capability. 

Furthermore, although the embodiments have been described in 
relation to examples in which the SS method (or the CSS method) and the 
CMN method are employed for noise elimination, instead of the SS method 
(or the CSS method) or the CMN method pr.oper, modifications thereof 
(e.g., the CMN method may be performed by distinguishing non-speech 
segments and speech segments) may be employed. 

Furthermore, for example, A cepstrum coefficients or A power may b 
used as the speech feature vectors. 

Furthermore, the present invention includes a storage medium such 
as a floppy disk, an optical disk, and a hard disk, on which a program 
defining the processes for implementing the present invention described 
above is stored. Alternatively, the program may be obtained via a 
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network. 

[Advantages] 

As described above, according to the present invention, speech dat< 
on which different types of noise have been superposed respectively are 
created, and the noise is eliminated from each of the speech data on 
which the noise has been superposed, and acoustic models corresponding 
to each of the noise types are created using the speech data which have 
undergone the noise elimination. When a speech recognition is actually 
performed, the type of a noise superposed on speech data to be 
recognized is determined, and an acoustic model is selected from the 
acoustic models corresponding to the noise types based on the result of 
the determination, the noise is eliminated from the speech data to be 
recognized on which the noise has been superposed by the predetermined 
noise elimination method, and a speech recognition is performed on the 
speech data which has undergone the noise elimination based on the 
selected acoustic model. 

Accordingly, the speech recognition is performed based on a most 
suitable acoustic model in accordance with the type of the noise 
superposed, achieving a high recognition rate even in a noisy 
environment . 

In particular, if a device is used in an environment where two or 
three types of stationary noise are present, a high recognition rate can 
be achieved by creating acoustic models for each of the noise types and 
performing a speech recognition as described above based on the acoustic 
models . 


I 
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One of the noise elimination method which may be employed according 
to the present invention is the spectral subtraction method or the 
continuous spectral subtraction method, in which case the noise 
elimination in the acoustic model creation process is performed by the 
spectral subtraction method or the continuous spectral subtraction 
method. When a speech recognition is actually performed, the type of a 
noise superposed is determined using the feature analysis data of the 
noise segment, a most suitable acoustic model is selected based on the 
result of the determination, the noise is eliminated by the spectral 
subtraction method or the continuous spectral subtraction method from 
the speech data to be recognized on which the noise has been superposed, 
and a speech recognition is performed on the result of a feature 
analysis of the speech data which has undergone the noise elimination 
based on the selected acoustic model. 

By employing the spectral subtraction method or the continuous 
spectral subtraction method as described above, noise elimination can be 
executed with a relatively small amount of operations feasible for a CPU 
with a relatively low operation capability. Accordingly, implementation 
in small-scale inexpensive hardware is allowed. Furthermore, since the 
spectral subtraction method and the continuous spectral subtraction 
method are believed to be effective in eliminating noise such as the 
sound of an automobile, the sound of an air conditioner, and the bustle 
of the city (generally referred to as additive noise) , it is highly 
advantageous when applied to devices typically used i'n an environment 
with a considerable amount of such noise. 
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As another example of noise elimination method, the cepstrum mean 
normalization method may be employed. In that case, the noise 
elimination in the acoustic model creating process employs the cepstrum 
mean normalization method. When a speech recognition is actually 
performed, the type of a noise superposed is determined using the 
feature analysis data of the noise segment, a most suitable acoustic 
model is selected based on the result of the determination, the noise is 
eliminated from the speech data to be recognized on which the noise has 
been superposed by the cepstrum normalization method, and a speech 
recognition is performed on the feature vector obtained by the noise 
elimination based on the selected acoustic model. 

By employing the cepstrum mean normalization method for noise 
elimination, noise elimination can be performed with a small amount of 
operations feasible for a CPU with a relatively low operation 
capability. 

This allows implementation in small-scale inexpensive hardware. 
Furthermore, because the cepstrum mean normalization method is believed 
to be effective in eliminating noise such as distortions due to 
microphone characteristics and spatial transmission characteristics 
including echo (generally referred to as multiplicative noise) , it is 
highly advantageous when applied to devices typically used in an 
environment where such noise is likely to be present. 

Furthermore, in addition to the noise types, the acoustic models 
may be created for different S/N ratios for each of the noise types, so 
that when a speech recognition is actually performed, the S/N ratio of 
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the noise superposed on the speech data to be recognized is obtained 
from the power of the noise segment and the power of the speech segment 
and an acoustic model is selected in accordance with the S/N ratio and 
the noise type, allowing recognition based on an acoustic model that is 
in accordance with the power of the noise as well as the noise type. 
Accordingly, a high recognition rate can be achieved when a speech 
recognition is performed in environments where each of the noises exist. 

Furthermore, the acoustic models may be created using both the 
spectral subtraction method or the continuous spectral subtraction 
method and the cepstrum normalization method. In this case, when a 
speech recognition is actually performed, noise is eliminated by the 
spectral subtraction method or the continuous spectral subtraction 
method, and the feature vector of the speech data which has undergone 
the noise elimination is generated by the cepstrum mean normalization 
method and the feature vector is supplied to the speech recognition unit 
for speech recognition, achieving a high accuracy of recognition, and in 
this case, allowing compatibility with a. wide range of noise including 
the additive noise and the multiplicative noise described earlier. 

Furthermore, the acoustic models may be created corresponding to a 
plurality of S/N ratios for a predetermined noise, so that the S/N ratio 
of the noise superposed on speech data to be recognized is determined, 
an acoustic model is selected in accordance with the S/N ratio, the 
noise is eliminated from the speech data to be recognized on which the 
noise has been superposed by the predetermined noise elimination method, 
and a speech recognition is performed on the feature vector of the 
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speech data which has undergone the noise elimination based on the 
selected acoustic model. 

This is advantageous in performing a speech recognition in an 
environment where a particular type of noise exists and the magnitude 

(S/N ratio) tends to vary, achieving a high recognition rate in such an 
environment. In this case, because the type of noise has been 
identified, the noise type needs not to be determined, reducing the 
amount of operations to such an extent feasible for a CPU with a 
relatively low operation capability. 

[Brief Description of the Drawings] 
[Fig. 1] 

Construction diagram for explaining a speech recognition apparatus 
according to a first embodiment of the present invention. 
[Fig. 2] 

Flowchart for explaining a processing procedure in the first 
embodiment . 
[Fig. 3] 

Construction diagram for explaining a speech recognition apparatus 
according to a second embodiment of the present invention. 
[Fig. 4] 

Flowchart for explaining a processing procedure in the second 
embodiment . 
[Fig. 5] 

Construction diagram for explaining a. speech recognition apparatus 
according to a third embodiment of the present invention. 
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[Fig. 6] 

Flowchart for explaining a processing procedure in the third 
embodiment . 

[Reference Numerals] 

1: speech input unit 

2: input speech processing unit 

3: first speech feature analysis unit 

4: noise segment /speech segment determination unit 

5: feature data storage unit 

6: noise type determination/acoustic model selection unit 

7 : acoustic model storage unit 

8: noise elimination unit 

9: second speech feature analysis unit 

10: speech recognition unit 

11: language model storage unit 

21: speech feature analysis unit 

31: CMN operation unit (noise elimination unit by the CMN method) 
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[Fig. 6] 


